



## Low Power Design

#### Volker Wenzel on behalf of Prof. Dr. Jörg Henkel Summer Term 2016

#### CES – Chair for Embedded Systems



ces.itec.kit.edu





## **Overview Low Power Design Lecture**



- Introduction and Energy/Power Sources (1)
- Energy/Power Sources(2): Solar Energy Harvesting
- Battery Modeling Part 1
- Battery Modeling Part 2
- Hardware power optimization and estimation Part 1
- Hardware power optimization and estimation Part 2
- Hardware power optimization and estimation Part 3
- Low Power Software and Compiler
- Thermal Management Part 1
- Thermal Management Part 2
- Aging Mechanisms in integrated circuits
- Lab Meeting

## Overview for today



- power consumption in HW
- operator scheduling

Overview



- Levels of abstraction
  - system
  - RTL
  - gate
  - transistor
- Challenges
  - optimize (ie. minimize for low power)
  - design /co-design (synthesize, compile, ...)
  - estimate and simulate



## Generic HW synthesis flow





- System-Level Design
- High-Level Synthesis
- Logic Synthesis
- Layout Synthesis

## Low power HW design flow



System-level

- Energy/power needs to be analyzed/optimized at each level of abstraction
- Necessitates appropriate • models for each level



(src.: [Anand98])



$$P_{avg} = P_{sw. cap.} + P_{short - circuit} + P_{leakage} + P_{static}$$

## • In general, four components:

- P<sub>sw.cap</sub> switching capacity power
- P<sub>short-circuit</sub> short-circuit power
- P<sub>leakage</sub> leakage power
- P<sub>static</sub> static power

(src.: [Anand98])



## Switching capacity power



- Caused by charging/discharging of parasitic capacitances
  - C<sub>L</sub>: cumulative parasitic capacitance
  - N: expected # of transitions per clock cycle
  - f: clock frequency

$$P_{sw.cap} = \frac{1}{2} C_L V_{dd}^2 N f$$





- aka stray capacitance
- exists between all electronic components due to proximity
- exist between conductors
- problematic in high frequency circuits

## Short-circuit power



- Caused by direct supply-to-ground path
  - K: constant (transistor size, technology)
  - $V_T$ : threshold voltage
  - т: input rise/fall time

$$P_{short-circuit} = K (V_{dd} - 2V_T)^3 \tau N f$$



Leakage power



$$P_{leakage} = (I_{subthreshold} + I_{oxide} + I_{diode}) \cdot V_{dd}$$

- I<sub>diode</sub>: diodes formed between diffusion region and substrate
- I<sub>oxide</sub>: electrons tunneling through the gate oxide
  - drops exponenitally with gate length





- not relevant in CMOS circuits
- Note: in some literature leakage power is denoted as "static power"
- relevant in
  - other logic families
  - some nMOS circuits where there is a constant path supply-to-ground (e.g. ECL)

## Breakdown of power consumption in HW



- "Leakage power will dominate in future ( <100nm) silicon technologies"
- one means to reduce leakage 100 300 Gate length power is to deploy dielectrics with a high k-value 250 power dissipation Dynamic 1 gate length (nm) power Possible trajectory 0.01 if high-k dielectrics reach mainstream Sub-Physical o production Normalized tot threshold leakage 0.0001 50 Gate-oxide leakage 0.0000001 2020 1990 1995 2000 2005 2010 2015 (src.: [Kim])



 http://www.intel.com/content/dam/www/public/us/en/docu ments/pdf/foundry/mark-bohr-2014-idf-presentation.pdf

### Hardware synthesis for low power



- Considered here: high-level synthesis (HLS) e.g.:
  - Operator scheduling
  - Module selection

. . .

- Glitch power reduction
- State transition reduction



Scheduling (in the context of high-level synthesis)

- assigns operations in the behavioral description to control steps or controller states.
- determines cycle-by-cycle behavior (i.e. sequence in which operations are performed)
- determines the sequence in which the various operations of the behavioral description are performed
- dictates which operations and variables can share the same functional units and registers.
- can be used to enable resource sharing for low power by ensuring that correlated variables and operations with correlated operands are appropriately sequenced so that they can share the same resources

#### Some repetition from ESI:

- multicycling (operation can take more than one cycle)
- chaining (multiple operations executed in one cycle)

(src.: [Anand98])



**Scheduling** can be performed so as to enable maximum resource sharing between operations that belong to instances of the same computational pattern, resulting in maximal exploitation of regularity during resource sharing

**Scheduling** can be used to distribute the slacks or mobilities of various operations in the DFG appropriately so that some operations may be performed using slower, more energy-efficient functional units. Thus, scheduling has an impact on the power trade-offs through module selection

**Scheduling** determines the distribution of operations over time, and hence affects the profile of the power consumption in the implementation over time (control steps or clock cycles). **Reducing** peak power is important due to packaging, cooling, and reliability considerations. The effect of scheduling on peak power will be illustrated later.

(src.: [Anand98])

Recap: Delay vs.  $V_{dd}$ 





$$P_{dyn} \sim V_{dd}^2 C_{load} f_{switch}$$

#### **Problem:** How to assign voltages to functional units?

(src.: [Sarraf95])

Normalized Delay vs Vdd





- DFG: directed acyclic graph G = (V,E)
- v<sub>i</sub> must preced v<sub>j</sub> : directed edge from v<sub>i</sub> to v<sub>i</sub>

(src.: [Sarraf95])

Volker Wenzel



vi

vj



"A critical path of a system is defined as the path in the DFG,  $\{v_1, v_2, ..., v_k\}$ , such that the summation of the latencies of the nodes in the path is maximal among all the paths of the DFG. The sum  $C_{o}$  is termed as the critical path length"



## **Motivation**





Behavioral Synthesis = High Level Synthesis



- Step 1: Initialization
- Step 2: Computation of Slack
- Step 3: Maximal slack value
- Step 4: Dual graph Hp
- Step 5: Weight Assignment
- Step 6: Longest weighted path
- Step 7: Reassigning voltages to nodes in the longest path
- Step 8: Goto Step 2

## Step 1 : Initialization



Step 1: Initialization Each of the nodes,  $v \in V$ , in G(V, E) is originally assigned a voltage  $V_c$  (also denoted by  $V_{c_0}$ ),  $\tau(v) = V_c$ . d(v) value is therefore initialized.







Step 2: Computation of slack Using depth first search calculate l(v) values for each of the nodes  $v \in V$  and hence obtain  $s(v) = kt_c - l(v)$ .





Step 3: Maximal slack value Identify the maximum slack value  $s^*$  in the graph G and all the nodes, v, such that  $s(v) = s^*$ . If the maximal slack value  $s^* = 0$  terminate the algorithm; we have obtained an optimal voltage assignment. The set of nodes with maximal slack value  $s^*$ , P, induces a subgraph  $G_p(P, E')$  in G.

Assume: ktc = 10ns



#### 5V, 2ns



## Step 4: Dual graph $H_p$ A dual graph $H_p(P, E_p)$ is obtained for the graph $G_p(P, E')$ : Obtain a Depth-First Search (DFS) ordering of the graph $G_p$ . Let D(v) be the order in which the node v is visited during the DFS. The dual graph, $H_p$ , is constructed on the node set P. If $u, v \in P$ and there does not exist a path from v to u and from u to v in $G_p$ , then if D(u) > D(v) then $(u, v) \in E_p$ .

$$C = Q = (c, g, e, f)$$

$$D(c) = 4$$

$$D(g) = 1$$

$$D(f) = 2$$
DFS order  $D(e) = 3$ 

$$W(c) = (V_c^2 - V_{c_1}^2)$$

$$W(e) = (V_c^2 - V_{c_1}^2)$$

$$W(e) = (V_c^2 - V_{c_1}^2)$$

$$W(f) = (V_c^2 - V_{c_1}^2)$$

$$H_p = W(g) = (V_c^2 - V_{c_1}^2)$$



Step 5: Weight assignment  
If a node 
$$v \in P$$
 is assigned a voltage  $\tau(v) = V_{c_k}$   
then assign a weight  $W(v) = (V_{c_k}^2 - V_{c_{k+1}}^2)$  in  $H_p$ .

$$C = (c, g, e, f)$$

$$D(c) = 4$$

$$D(g) = 1$$

$$D(f) = 2$$

$$D(e) = 3$$

$$W(c) = (V_c^2 - V_{c_1}^2)$$

$$W(e) = (V_c^2 - V_{c_1}^2)$$

$$W(e) = (V_c^2 - V_{c_1}^2)$$

$$H_p = W(g) = (V_c^2 - V_{c_1}^2)$$

$$P_{dyn} \sim V_{dd}^2 C_{load} f_{switch}$$



#### Step 6: Longest weighted path

Obtain a longest weighted path in  $H_p$ . The longest weighted path in a directed acyclic graph is defined as the path in the graph which has the

maximum total node weight and it can be obtained using a single *breadth first search*.

 $G_n$ D(e) = 3 $W(c) = (V_c^2 - V_{c_1}^2)$  $W(e) = (V_c^2 - V_{c_1}^2)$  $W(f) = (V_c^2 - V_{c_1}^2)$  $W(g) = (V_c^2 - V_{c_1}^2)$  $H_p$ 



Step 7: Reassigning voltages to nodes in the longest path

Reassign voltages to nodes in the longest weighted path obtained in the previous step. If a node in the longest path, v, has a prior voltage assignment  $\tau(v) = V_{c_k}$  then change this assignment to  $\tau(v) = V_{c_{k+1}}$ . The new assignment of voltages to nodes in P changes the delay values (d(v) values) for nodes.

D(e) = 3 $W(c) = (V_c^2 - V_{c_1}^2)$  $W(e) = (V_c^2 - V_{c_1}^2)$  $W(f) = (V_c^2 - V_{c_1}^2)$  $W(g) = (V_c^2 - V_{c_1}^2)$ 

Step 8



• Goto Step 2



## **Theorem 1** Given a set of allowable voltages S and data flow graph G(V, E), the above algorithm produces a mapping $\tau: V \to S$ that minimizes $\sum_{v_i \in V} \tau(v_i)^2$ .

$$P_{dyn} \sim V_{dd}^2 C_{load} f_{switch}$$

no proof in this lecture!



# **Theorem 2** The time complexity of the algorithm is $O(kn^2)$ , where $kt_c$ is the timing constraint and n is the number of nodes in G.





#### Experimental Setup: Simulation using High-Level Synthesis Benchmark Tools

| Bench-<br>mark | $\begin{array}{c} \text{Timing} \\ \text{Constraint } k \end{array}$ | Power<br>using 5V | Power<br>using 5V, 3V | x1<br>% reduc.                             | Avg. %<br>reduc. | Power using<br>5V, 3V, 2.4V | x2<br>% reduc.                              | Avg. %<br>reduc. |
|----------------|----------------------------------------------------------------------|-------------------|-----------------------|--------------------------------------------|------------------|-----------------------------|---------------------------------------------|------------------|
|                | 4 †                                                                  | 275               | 195                   | 29.1                                       |                  | 195                         | 29.1                                        |                  |
| Diffeq         | <u>5</u><br>6                                                        | $\frac{275}{275}$ | $\frac{179}{147}$     | $\frac{34.91}{46.55}$                      | 40.73            | 172.52<br>130.8             | $\frac{37.27}{52.44}$                       | 44.56            |
|                | 7                                                                    | 275               | 131                   | 52.36                                      | 1                | 111.56                      | 59.43                                       | 1                |
| FIR            | 9 †<br>10                                                            | $525 \\ 525$      | $\frac{349}{317}$     | 33.53<br>39.62                             |                  | 326.32<br>287.84            | $37.84 \\ 45.17$                            | -                |
|                | 11                                                                   | 525               | 301                   | 42.67                                      | 40.38            | 265.36                      | 49.46                                       | 46.4             |
|                | 12<br>8 †                                                            | $\frac{525}{700}$ | $\frac{285}{604}$     | $\frac{45.71}{13.71}$                      |                  | 246.12<br>584.56            | $\begin{array}{r} 53.12\\ 16.49\end{array}$ |                  |
| AR-Lattice     | 9                                                                    | 700               | 540                   | 22.86                                      |                  | 520.56                      | 25.63                                       | 1                |
| Filter         | 10<br>12                                                             | 700<br>700        | $\frac{476}{412}$     | $\begin{array}{r} 32.0\\ 41.14\end{array}$ | 27.43            | 456.56<br>392.56            | 34.77<br>43.92                              | 30.20            |
|                | 15 †                                                                 | 850               | 690                   | 18.82                                      |                  | 677.04                      | 20.35                                       |                  |
| EWFilter       | 16<br>17                                                             | 850<br>850        | 642<br>610            | $\frac{24.47}{28.24}$                      | 25.88            | 629.04<br>590.56            | $\frac{25.99}{30.52}$                       | 27.88            |
|                | 18                                                                   | 850               | 578                   | $\frac{28.24}{32.00}$                      | 20.88            | 555.32                      | 34.67                                       | 21.00            |

Table 1: Power Consumption Results for smaller Timing Constraintst: Corresponds to the longest path length for the DFG.



| Bench-     | Timing         | Power      | Power        | x1                     | Avg. % | Power using  | x2                     | Avg. % |
|------------|----------------|------------|--------------|------------------------|--------|--------------|------------------------|--------|
| mark       | Constraint $k$ | using $5V$ | using 5V, 3V | % redu <mark>c.</mark> | reduc. | 5V, 3V, 2.4V | % re <mark>duc.</mark> | reduc. |
|            | 8              | 275        | 99           | 64                     |        | 82.8         | 69. <mark>89</mark>    |        |
| Diffeq     | 12             | 275        | 99           | 64                     | 64     | 63.36        | 76. <mark>96</mark>    | 73.43  |
|            | 18             | 525        | 189          | 64                     |        | 150.12       | 71. <mark>41</mark>    |        |
| FIR        | 27             | 525        | 189          | 64                     | 64     | 120.96       | 76. <mark>96</mark>    | 74.19  |
| AR-Lattice | 16             | 700        | 252          | 64                     |        | 232.56       | 66. <mark>77</mark>    |        |
| Filter     | 24             | 700        | 252          | 64                     | 64     | 161.28       | 76. <mark>96</mark>    | 71.87  |
|            | 30             | 850        | 306          | 64                     |        | 280.08       | 67. <mark>05</mark>    |        |
| EWFilter   | 45             | 850        | 306          | 64                     | 64     | 195.84       | 76. <mark>96</mark>    | 72.00  |

Table 2: Power Consumption Results for larger Timing Constraints

higher timing constraints here





- ... is the process of mapping operations from the CDFG to component templates from the RTL library.
- Initially, only a **functional unit template**, not a specific instance, associated with each operation
- Example: "+"-Operation may be implemented using
  - ripple-carry adder (slower, but more switched capacitance efficient)
  - carry-lookahead adder (faster, incures higher switched capacitance)
  - carry-select adder

- ..

- similar tradeoffs for other operations
- Idea: Exploit tradeoffs to fulfill power constraints through module selection

## Module selection for LP (cont'd)



Each operation in the DFG (middle) has been mapped to fast component in order to meet performance constraints (in that case constraint: 85ns)
But is that really necessary? => no, not all ops need necessarily be mapped to fastest module. Focus (for timing constraint) should rather be on critical path
Idea: slack in off-critical path ops may be used to select slower functional units that may have a better efficiency in switched capacitance (see right DFG). There, mult op uses less power (but not less energy)
Important: to have a large module library with distinct switching capacity efficiencies and performance characteristics





- Power consumption in HW:  $P_{avg} = P_{sw. cap.} + P_{short - circuit} + P_{leakage} + P_{static}$
- Operator Scheduling:
  - reread [Sarraf95] at home;
  - understanding notation takes a bit of time

issues in [Sarraf95]:

- τ is used for different things
- $\Phi$  should be substituted by  $\varnothing$
- 'integral'  $\rightarrow$  'integer'



[Heer04] Ch. Herr, U. Schlichtmann, "Ultra-Low-Power Design: Device and logic design approaches", pp. 1-20, in "Ultra Low-Power Electronics and Design" by Kluwer, 2004.

[Anand98] A. Raghunathan, N.K. Jha, S. Dey, "High-level power analysis and optimization", Kluwer Academic Publishers, 1998.

[Sarraf95] S. Raje, M. Sarrafzadeh, "Variable voltage scheduling", IEEE/ACM ISLPED 1995. pp. 9-14, 1995.

[Knight] R.S. Martin, J.P. Knight, "Power-Profiler: Optimizing ASIC's Power Consumption at the behavioral level", Proc. Of IEEE/ACM Design Automation Conf. (DAC'95), pp.42-47,1995.

[Macii04] E. Macii (Ed.), "Ultra Low-Power Electronics and Design", Kluwer Academic Publishers, 2004.

[Devadas] Alidina, M.; Monteiro, J.; Devadas, S.; Ghosh, A.; Papefthymiou, M.; "Precomputation-based Sequential Logic Optimization For Low Power", Computer-Aided Design (ICCAD), 1994., IEEE/ACM International Conference on November 6-10, 1994 Page(s):74 – 81.

[Ragh99] Raghunathan, A.; Dey, S.; Jha, N.K.; "Register transfer level power optimization with emphasis on glitch analysis and reduction", Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 18, Issue 8, Page(s):1114 – 1131, Aug. 1999.

[Tivari] Tiwari, V.; Malik, S.; Ashar, P.; "Guarded evaluation: pushing power management to logic synthesis/design", Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Volume 17, Issue 10, Page(s):1051 – 1060, Oct. 1998.

[Mehra] R. Mehra, J. Rabaey, "Exploiting Regularity for Low Power Design", IEEE/ACM Intl' Conference on Computer Aided Design (ICCAD96), pp. 166-172, 1996.

[Keating] Keating, Michael, David Flynn, Rob Aitken, Alan Gibbons, and Kaijian Shi. Low power methodology manual: for system-onchip design. Springer Publishing Company, Incorporated, 2007.

[Kim] Kim, N.S.; Austin, T.; Baauw, D.; Mudge, T.; Flautner, K.; Hu, J.S.; Irwin, M.J.; Kandemir, M.; Narayanan, V., "Leakage current: Moore's law meets static power," Computer , vol.36, no.12, pp.68,75, Dec. 2003